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Abstract. This paper introduces a probability density estimator based on Green's function 
identities. A density model is constructed under the sole assumption that the probability density 
is differentiable. The method is implemented as a binary likelihood estimator for classification 
purposes, so issues such as mis-modeling and overtraining are also discussed. The identity 
behind the density estimator can be interpreted as a real-valued, non-scalar kernel method 
which is able to reconstruct differentiable density functions. 



1. Theory of density estimation 

Generally, a density estimator is an identity operator for the probability density function. If it 
is a linear operator, then it is an approximation of the Dirac-delta function: 



Fixed kernel methods usually estimate the density with a bias, since the kernel does not converge 
to the delta function. The same can be seen on histograms with fixed bin sizes, and to make them 
unbiased certain adaptivity is necessary. Different kernels usually maintain different features of 
the fet, such as differentiability or easy recognizability of peaks [1]. Non-adaptive methods 
are the estimation from the moments via Fourier integrals [2] or the estimations that target 
the integral probability density. A problem with the methods using Fourier integrals, is that 
they may require computationally expensive volume integrals, and the result is not necessarily 
a density function until sufficient number of moments are known. 

To avoid the complexity of finding the global minimum of the error function, most of the 
overtraining problems and heavy volume integrals, we tried to define a successive adaptive 
kernel method to estimate the probability densities, using a kernel that is suppressed at large 
distances. The starting point is, that functionals like linear operators are preferred, as they can 
be successfully approximated with a sample of a distribution, because they act like expectation 
values of random variables. 
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where L is a linear operator, p is a parameter vector and n(x) is a probability measure. To have 
such an estimator for the probability density itself, an identity operator is needed: 



H{x) = = P (/ L(x,x')p(x')dn(x'] 



where P is a necessary post-processing to convert the linear operator's output to a scalar. 
Identity operators can be constructed using Green functions of operators. A particularly 
interesting one is the Green function of the Laplace operator in n > 3 dimensions 1 . 

AG(x,x') = 6(x-x') (1) 
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Here S n is a constant, the surface of a unit sphere in n dimensions. For twice differentiable 
functions c(x) that disappear at the boundaries, lim c(x) = 0, the identity operator can be 

constructed: 



V ' S n J \x-x'\ n - 2 

= ^/^ | x _' r -2 -^)dV. 

This can not be used directly for density estimations, since it is linear in c^'c(x') instead of 
c(x'), but an identity can be constructed for the vector field of the first derivative: 

d ^ = i j ^ \ X -l'\n-2 ■ d»>c(x')d n x' . 

R™ 

The amplitude of c^c(x) is still a scalar function; hence for those probability measures, g(x), 
where a c(x) exists 2 with |<9^c(x)| = g(x), there must be at least one (f>n{x) vector field consisting 
of <$> tl <ft i = 1 unit vectors, for which the following identity holds: 

g(x)4>,(x) = 1- I d,d», 1 _ 2 -^(x / )g(x / )dV. (2) 
o n j \x — x | 

Since the integral operator was derived from the Laplace operator, in three dimensions it is 
similar to the dipole interaction operator in electromagnetism, hence 0^ might be called a dipolc 
field. What is important is that this M (x) field can be found without the knowledge of the true 
g(x); only the possibility of calculating the expectation values on the left-hand side of eq. (2) 
is required. In this case it is possible to calculate the left-hand side for any configuration of 
4>n'(x'), until the expectation value E^(x) has the same direction as </> M (x) at every x: 

^(x) || E,{x) = -1 / d^- \— 2 ■ ^(x')g(x') dV . (3) 

o n j \x x | 

R" 

1 Although the Green's function is different for n = 2, the same procedure can be repeated and it gives the same 
form of functions at the end, making method valid in 2 dimensions. 

2 See the subsection targeted at the existence of such a c(x). 



A way to show that E^{x) « g(x)4> ll (x) is to do an integration by parts first: 

E,(x) = j- / d»d», _ 1 .<t>»,(x')g(x') dV = - j- / ^ |-_L-2 -^ dV. 

Tl */ | ' ' | Tt tj | ' ' ■ | 

1" R n 

where the integral on the boundaries is zero, since the derivatives of the probability measures 
are also suppressed at infinity. It is possible to do an external differentiation now, calculating 
the divergence of E^x), and using eq. (1): 

8^(x) = [ —d^- -fy {M^)9&)) ■ dV = (Mx)g{x)) . 

J "J n \x — x i 
R™ y v ' 

AG(x,x>)=6(x-x>) 

For an arbitrary (j)^, E^ and <7<^ can only differ by a divergence-free vector field. In order to 
show \E\ ~ g, a certain measure is needed for the difference. With the square of the difference 
it becomes: 

J (g(x)^(x)-E^x)) 2 d n x>0. (4) 

R n 

Expanding the square, it will have three terms: 

J d n x jj^ -2 g^Ep + E^ . 

R™ invariant dip olc energy field energy 

The field energy can be expressed in terms of the dipole energy. Explicitly writing it out: 



Performing integration by parts on the I m t term, A \ x _l\ n -s appears. Due to eq. (1) this can be 
substituted with a delta function. The remaining part has the derivative of the delta function, 
which can be easily evaluated: 

f 11 

/int. = S n J d n zd x ^5(x - Z)d yv - _ ^ |w _ 2 = S n d yv d x ^ x _^ n _ 2 . 

As a result the field energy equals the dipole energy 

j d n xE 2 = J d n xg^E ii , 

R n R n 

so the eq. (4) can be modified accordingly: 

j (g(x)^(x) - E^(x)f d n x = f d n xg 2 - g^ > . 

R" R« 

The minimum is not necessarily zero, but it can be reached by finding the energy minimum 
of the dipole system. 



1.1. Existence of the optimum 

It was assumed for eq. (2), that an identity operator exists for a given g{x) probability density 
with a special </> M (x) configuration: 

d^c(x) = g(x)(f>, x (x) . 

In other words, such a c(x) has to exist which has a gradient with specified absolute value 
\d^c{x)\ = g{x). This only exists if g(x)4> fM (x) can be made curl free, and all possible loop 
integrals should result in zero: 

L 

As an n-dimensional cube has n(n — 1) faces, from which n(n — l)/2 share the same corner, 
n(n — l)/2 linearly independent infinitesimal loops can be imagined. So the curl in n-dimensions 
may have that many independent variables, which has to be zeroed out by a <p u vector field with 
only n — 1 independent variables. That seems to be non-trivial, but for differentiable g(x), it 
turns out it can be done easily. The curl in question is: 

[curl gWMx)]^ = M (<^)eM - d {v) (g<f>^ = , (5) 

where ej^ denotes the unit basis vector in the v direction, and the bracketed indices are not 
summed up. In the basis e$ = 5^^, the curl is simplified: 

[curl f{x)(f)^x)} uv = d (u) {g<p {v) ) - d {r)) (g<j> {v) ) = 0. 

As only certain derivatives of cf)^ appear, one possibility to anull the curl is to set every 
subcomponent to zero: 

d(u){g<P(u)) = 
d(y)(j)(v) _ d^g 
4>(v) 9 

In this differential equation only the relative change of <p v appears, so setting 4>^ii = 1 at every 
x is still possible. As a consequence, the (p^ vector field exists for any differentiable g, making 
gcpfj, curl-free, and a c(x) function exists with the property 3 M c(x) = g(x)4>^(x), making eq. (2) 
an identity. 



2. Density estimation from a finite sample 

As long as a suitable (p^x) field is found that satisfies eq. (3), it can be used to estimate any 
differentiable probability density function from a sample {xj}: 
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However, since the Green's function has divergences, a finite sample may produce unnecessarily 
large fluctuations. Fortunately, volumes around most of the divergences can be neglected, as 
their integral is zero. A homogeneous g(x) = g, homotropic </>(x) M = 4>^ sphere surface around 
a point x gives no contribution: 



/ \ X -l'\n-2 9 ^' dV = 9 ^' d » j ^' \ X - l X '\n-2 dV 



M" 9 " J - \ x -l>\n~i V dS n dr' = 0. 

S n ' . ' 

const. 



I ^ dS n =0 



As a result, a small sphere of radius AR around the point of interest x can be excluded from the 
summation in eq. (2), since the error will go as d^(g(x)4> v (x))0(AR) for any differentiable g(x) 
and 4>(x). The only exception where the error is not O(AR) is for 5 functions, so these points 
have to be treated independently. 

In order to have control on the fluctuations, one has to choose the AR radius of the exclusion 
volume in such a way that a 5r thick shell around this sphere contains enough sample points 
to estimate the integral belonging to that area. In this case the points in this layer will have 
a dominant contribution to the sum between {-^m, (AR+Sr) n }' s i nce points at larger distances 
will have decreasing values. To have the contribution in the shell to be of the same order of 
magnitude, 5r has to be around AR/n. This approximation is derived from the behaviour of 
the (1 + 1 / n )~ n function, which becomes nearly flat in n, with a value around 1 / / 2- From this 
it follows that if one needs -/V\ ar g e number of points in this stable layer, approximately n • Ai ar g C 
points have to be discriminated, which eventually determines AR. In practice one may choose 
a different Marge for the iterative search of the 4> u (xi) parameters and for the evaluation of the 
density estimation. 

2.1. Finding the parameters 

With a suitable iVdi Scr , the set of {(j)^ = <p^{xi)} that can be used for density estimations can be 
found with iterative search. The stopping criterium can be formulated with discretising eq. (3): 

^ II E^xi) = -3— 9 ^' tz 1 zn ^'j ■ ( 7 ) 

| {k: \Xi-X k | <Rdiscr } | =Ar discr 

As this requirement is similar to looking for the ground state of the dipole system, a way to 
find it is the gradient descent on the energy function. With = E^{xi) this direction is: 

d<p iv = (1 - 4>i n 4>i^)Ei^ , (8) 

which is a rotation with the field in the directions perpendicular to 4>^. A gradient descent 
method typically contains a line search part, but it is not necessary here. It is enough to set 
a limit for the angular change of the an order or magnitude smaller than the maximal 
possible, | d0 max | = 0.1 < n, seems to be a good choice, which ensures that the E n field won't 
change significantly during the iteration, keeping it a valid gradient. Further speed increase 
can be achieved, while taking care not to overshoot the angle between Ei n and (pi n , since this 
is what minimises the energy. However, there might be several local energy minima, but the 
identity in eq. (2) is only true if g<f> is curl free. Following from the inequality in eq. (4) it means 
that the non-global minima solutions are not satisfying the identity and they are not curl free. 
These solutions will hold a bias in the density estimations, which can be relatively big in the 



low density regions. This is because the non-optimal minima have a remnant polarisation, for 
example the destructive interferences do not fully cancel at zero density regions. 

As can be seen on fig. 1, the algorithm actually performs as a density estimator. A sample 
with a size of two thousand points from a two dimensional Gaussian distribution was fed into 
the parameter search, and the density estimation was evaluated in the described manner, at 
the position of sample points. The figure shows the average of the estimated densities for a 
given radius, where the theoretical value is constant, because of the rotational symmetry of the 
chosen Gaussian distribution. The spread was calculated from the variance of the estimations 
on the same radius. The uncertainty can be directly compared with the simplest, kernel-less 
nearest A;-neighbour method [1] in fig. 2. It can be seen, that the uncertainty for the discussed 
density estimator in fig. 1 is approximately half of that from the nearest /c-neighbour method 
with flat kernel in fig. 2. The resolution were chosen to be the same for the two figures, given 
that the exclusion radii, defined in Section 2, were the same as the radii that was used for volume 
calculation in the nearest ^-neighbour method. 




Distance from origin 



Figure 1. Radial density estimate profile of 
a Gaussian distribution placed at the origin. 
The crosses show the spread of the Green's 
function density estimate compared to the 
true value shown as the dashed red curve. 
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Figure 2. Profile plot for the density esti- 
mate made for the same Gaussian sample as 
in fig. 1 made with the nearest /c-neighbour 
method with a flat kernel. 



3. Classification and overtraining 

Although binary classification can be done with binary regression models, their results typically 
depend only on the fraction of the signal s(x) and background density b(x), which is optimal due 
to the Neyman- Pearson Lemma [3]. This is also true for an optimally trained neural network 
with a quadratic loss function, where the response is usually the following [1][4]: 

rNN = 

s(x) + b(x) 

In case the network is not optimally trained, the error of the response tnn can be related to 
the uncertainty of the s(x)/b(x) fraction. The same fraction can be calculated from a density 
estimator, giving a likelihood estimation. The uncertainty in the latter case is much better under 
control, since the calculation of the signal and background densities are made independently. 
For the method described in this article, a density estimation for a single point comes from the 
superposition of the kernels of the most dominant N\ arge sample points, hence by construction 
the likelihood estimation can not come from a single point, as happens during overtraining in 
regressions. 



Figure 3 and fig. 4 show the response distribution for a network trained on a signal consisting 
of twelve non-overlapping Gaussians and a flat background [5]. Unlike in many regression models, 
here there is no need for an independent test sample for an early stopping criterium. Figure 3 
is evaluated on the full training sample, while fig. 4 was made on an independent sample. The 
shape of the two responses are essentially same, the slight difference is at the part where either 
the signal or the background distribution density is very low, since these areas have the largest 
uncertainty in density estimators. 
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Figure 3. Distribution of the response 
function on the 10 4 training sample. 



Figure 4. Distribution of the response on an 
independent 10 5 test sample. 



4. Conclusions 

The article presents a novel class of density estimators, based on Green's function identities. The 
investigated one is derived from the Green's function of the Laplace operator which provides 
meaningful results for once differentiable probability density functions. It is possible to construct 
density estimators from other linear operators, but they are expected to require either higher 
order differentiability or other functional constraints on the investigated probability density. As 
it was shown, such a density estimator can be used for binary classification. In such case, due 
to the handling of the fluctuations, overtraining is not so prominent as it is in binary regression 
models. Therefore there is less need for independent test samples. 
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